Aprentissage non supervisé¶

l'apprentissage Non Supervisé consiste à utiliser des données non étiquetées et utiliser des algorithmes de clustering qui vont fouiller dans les données pour chercher à les segmenter en un ensemble de groupes homogènes en se basant sur des mesures de similarités.

  • Exemple d'algorithmes de clustering : KMeans

image.png

  • L'algorithme des k-means (ou bien k-moyennes) est l’un des algorithmes d’apprentissage non supervisé qui permet d’analyser un jeu de données afin de regrouper les données « similaires » en groupes (ou clusters).
  • Ce principe est applicable dans divers domaines de l'industrie comme le domaine du marketing pour la segmentation des clients, analyse des réseaux sociaux, segmentation des images médicales.
  • Principe de Fonctionnement :

image.png

image.png

In [1]:
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt

image.png

image.png

Data Set¶

In [2]:
df = pd.read_csv('salaires.csv')
In [3]:
df
Out[3]:
Name Age Salary
0 Ines 29 4900
1 Omar 32 5300
2 Sabrina 40 6500
3 Souad 41 6300
4 Nezha 43 6400
5 Rabia 39 8000
6 Soufiane 41 8200
7 Abdellah 39 5800
8 Mohamed 27 7000
9 Yassine 29 9000
10 Azizi 38 16000
11 Ahmed 36 15600
12 Yasmine 35 13000
13 Aya 37 13700
14 Zakaria 26 4500
15 Fatima 27 4800
16 Salim 28 5100
17 Imane 29 6000
18 Ismail 28 6000
19 Malak 42 15000
20 Hanane 39 15500
21 Ibrahim 41 16000
In [39]:
plt.scatter(df['Age'], df['Salary'])
plt.xlabel("Age")
plt.ylabel("Salary")
Out[39]:
Text(0, 0.5, 'Salary')
No description has been provided for this image

Clustering avec K-Means¶

In [5]:
km = KMeans(n_clusters=3, random_state=0)
In [6]:
X = df.drop(columns=['Name'])
In [7]:
km.fit(X)
Out[7]:
KMeans(n_clusters=3, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, random_state=0)
In [8]:
km.labels_
Out[8]:
array([1, 1, 1, 1, 1, 2, 2, 1, 2, 2, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0],
      dtype=int32)
In [9]:
predicted = km.predict(X)
In [10]:
predicted
Out[10]:
array([1, 1, 1, 1, 1, 2, 2, 1, 2, 2, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0],
      dtype=int32)
In [11]:
km.labels_==predicted
Out[11]:
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True])
In [12]:
df['cluster']= predicted
In [13]:
df
Out[13]:
Name Age Salary cluster
0 Ines 29 4900 1
1 Omar 32 5300 1
2 Sabrina 40 6500 1
3 Souad 41 6300 1
4 Nezha 43 6400 1
5 Rabia 39 8000 2
6 Soufiane 41 8200 2
7 Abdellah 39 5800 1
8 Mohamed 27 7000 2
9 Yassine 29 9000 2
10 Azizi 38 16000 0
11 Ahmed 36 15600 0
12 Yasmine 35 13000 0
13 Aya 37 13700 0
14 Zakaria 26 4500 1
15 Fatima 27 4800 1
16 Salim 28 5100 1
17 Imane 29 6000 1
18 Ismail 28 6000 1
19 Malak 42 15000 0
20 Hanane 39 15500 0
21 Ibrahim 41 16000 0
In [14]:
cluster1 = df[df['cluster']==0]
cluster2 = df[df['cluster']==1]
cluster3 = df[df['cluster']==2]
In [15]:
plt.scatter(cluster1['Age'], cluster1['Salary'], color='red')
plt.scatter(cluster2['Age'], cluster2['Salary'], color='blue')
plt.scatter(cluster3['Age'], cluster3['Salary'], color='green')
Out[15]:
<matplotlib.collections.PathCollection at 0x29feaf390>
No description has been provided for this image

Normalisation et standardisation des données¶

image.png

In [16]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
In [17]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
In [18]:
X_scaled
Out[18]:
array([[0.17647059, 0.03478261],
       [0.35294118, 0.06956522],
       [0.82352941, 0.17391304],
       [0.88235294, 0.15652174],
       [1.        , 0.16521739],
       [0.76470588, 0.30434783],
       [0.88235294, 0.32173913],
       [0.76470588, 0.11304348],
       [0.05882353, 0.2173913 ],
       [0.17647059, 0.39130435],
       [0.70588235, 1.        ],
       [0.58823529, 0.96521739],
       [0.52941176, 0.73913043],
       [0.64705882, 0.8       ],
       [0.        , 0.        ],
       [0.05882353, 0.02608696],
       [0.11764706, 0.05217391],
       [0.17647059, 0.13043478],
       [0.11764706, 0.13043478],
       [0.94117647, 0.91304348],
       [0.76470588, 0.95652174],
       [0.88235294, 1.        ]])
In [52]:
scaler.inverse_transform([[0.88235294, 1.       ]])
Out[52]:
array([[   40.99999998, 16000.        ]])

Pipe Line de Modèles¶

In [136]:
estimator = Pipeline([
    ("MinMaxScaler", MinMaxScaler()),
    ("clustering", KMeans(n_clusters=3, random_state=0))
]
)
In [20]:
estimator.fit(X)
Out[20]:
Pipeline(steps=[('MinMaxScaler', MinMaxScaler()),
                ('clustering', KMeans(n_clusters=3, random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('MinMaxScaler', MinMaxScaler()),
                ('clustering', KMeans(n_clusters=3, random_state=0))])
MinMaxScaler()
KMeans(n_clusters=3, random_state=0)
In [22]:
predicted2 = estimator.predict(X)
In [ ]:
df['cluster2'] = predicted2
In [24]:
cluster1 = df[df['cluster2']==0]
cluster2 = df[df['cluster2']==1]
cluster3 = df[df['cluster2']==2]
In [55]:
centers_scaled = estimator[-1].cluster_centers_
centers = estimator[0].inverse_transform(centers_scaled)
In [56]:
centers_scaled
Out[56]:
array([[0.72268908, 0.91055901],
       [0.1372549 , 0.11690821],
       [0.85294118, 0.2057971 ]])
In [57]:
centers
Out[57]:
array([[   38.28571429, 14971.42857143],
       [   28.33333333,  5844.44444444],
       [   40.5       ,  6866.66666667]])
In [70]:
plt.scatter(cluster1['Age'], cluster1['Salary'], color='red', label = "cluster 1")
plt.scatter(cluster2['Age'], cluster2['Salary'], color='blue', label = "cluster 2")
plt.scatter(cluster3['Age'], cluster3['Salary'], color='green', label = "cluster 3")
plt.scatter(centers[:,0], centers[:,1], color="black", marker='*', s=100, label="Cluster Centers")
plt.xlabel("Age")
plt.ylabel("Salary")
plt.legend()
Out[70]:
<matplotlib.legend.Legend at 0x2aa9b9d10>
No description has been provided for this image

Elbow et Silohouette pour trouver le nombre optimum de clusters¶

image.png

In [36]:
from sklearn.metrics import silhouette_score
sse=[]
sil=[]
krange = range(1,10)
for k in krange:
    km = Pipeline([
        ("MinMaxScaler", MinMaxScaler()),
        ("clustering", KMeans(n_clusters=k, random_state=0))
        ]
        )
    km.fit(X)
    sse.append(km[-1].inertia_)
    if(k==1):
        sil.append(0)
    else :    
        sil.append(silhouette_score(X, km[-1].labels_))
    
In [72]:
sse
Out[72]:
[5.522140987850921,
 2.139183706762151,
 0.4796830920366221,
 0.3546287126734011,
 0.26819371493732613,
 0.22690756427112166,
 0.18904083568265526,
 0.18057431161055554,
 0.14689918738539565]
In [73]:
sil
Out[73]:
[0,
 np.float64(0.25657476200964113),
 np.float64(0.35398455359639963),
 np.float64(0.1995366438963641),
 np.float64(0.19160118123962888),
 np.float64(0.0694716270838743),
 np.float64(-0.020201042966170576),
 np.float64(0.1380407992851944),
 np.float64(0.2576952732350539)]
In [37]:
plt.plot(krange, sse,)
plt.xlabel("Number of clusters")
plt.ylabel("SSE")
Out[37]:
Text(0, 0.5, 'SSE')
No description has been provided for this image
In [71]:
plt.plot(krange, sil,)
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette")
Out[71]:
Text(0, 0.5, 'Silhouette')
No description has been provided for this image

Principal Component Analysis (PCA)¶

image.png

In [74]:
from sklearn.datasets import load_digits
import pandas as pd
In [75]:
dataset = load_digits()
In [76]:
dataset.keys()
Out[76]:
dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])
In [83]:
data = dataset.data
In [85]:
data.shape
Out[85]:
(1797, 64)
In [86]:
data[0]
Out[86]:
array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])
In [87]:
data[0].reshape(8,8)
Out[87]:
array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])
In [94]:
from matplotlib import pyplot as plt
In [95]:
plt.gray()
plt.matshow(dataset.data[0].reshape(8,8))
Out[95]:
<matplotlib.image.AxesImage at 0x2aad07390>
<Figure size 640x480 with 0 Axes>
No description has been provided for this image
In [99]:
plt.matshow(dataset.data[5].reshape(8,8))
Out[99]:
<matplotlib.image.AxesImage at 0x2aacdbd90>
No description has been provided for this image
In [103]:
dataset.target[:20]
Out[103]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [104]:
df = pd.DataFrame(data=data, columns=dataset['feature_names'])
In [105]:
df
Out[105]:
pixel_0_0 pixel_0_1 pixel_0_2 pixel_0_3 pixel_0_4 pixel_0_5 pixel_0_6 pixel_0_7 pixel_1_0 pixel_1_1 ... pixel_6_6 pixel_6_7 pixel_7_0 pixel_7_1 pixel_7_2 pixel_7_3 pixel_7_4 pixel_7_5 pixel_7_6 pixel_7_7
0 0.0 0.0 5.0 13.0 9.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 6.0 13.0 10.0 0.0 0.0 0.0
1 0.0 0.0 0.0 12.0 13.0 5.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 11.0 16.0 10.0 0.0 0.0
2 0.0 0.0 0.0 4.0 15.0 12.0 0.0 0.0 0.0 0.0 ... 5.0 0.0 0.0 0.0 0.0 3.0 11.0 16.0 9.0 0.0
3 0.0 0.0 7.0 15.0 13.0 1.0 0.0 0.0 0.0 8.0 ... 9.0 0.0 0.0 0.0 7.0 13.0 13.0 9.0 0.0 0.0
4 0.0 0.0 0.0 1.0 11.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 2.0 16.0 4.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1792 0.0 0.0 4.0 10.0 13.0 6.0 0.0 0.0 0.0 1.0 ... 4.0 0.0 0.0 0.0 2.0 14.0 15.0 9.0 0.0 0.0
1793 0.0 0.0 6.0 16.0 13.0 11.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 6.0 16.0 14.0 6.0 0.0 0.0
1794 0.0 0.0 1.0 11.0 15.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 2.0 9.0 13.0 6.0 0.0 0.0
1795 0.0 0.0 2.0 10.0 7.0 0.0 0.0 0.0 0.0 0.0 ... 2.0 0.0 0.0 0.0 5.0 12.0 16.0 12.0 0.0 0.0
1796 0.0 0.0 10.0 14.0 8.0 1.0 0.0 0.0 0.0 2.0 ... 8.0 0.0 0.0 1.0 8.0 12.0 14.0 12.0 1.0 0.0

1797 rows × 64 columns

In [106]:
df.describe()
Out[106]:
pixel_0_0 pixel_0_1 pixel_0_2 pixel_0_3 pixel_0_4 pixel_0_5 pixel_0_6 pixel_0_7 pixel_1_0 pixel_1_1 ... pixel_6_6 pixel_6_7 pixel_7_0 pixel_7_1 pixel_7_2 pixel_7_3 pixel_7_4 pixel_7_5 pixel_7_6 pixel_7_7
count 1797.0 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000 ... 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000
mean 0.0 0.303840 5.204786 11.835838 11.848080 5.781859 1.362270 0.129661 0.005565 1.993879 ... 3.725097 0.206455 0.000556 0.279354 5.557596 12.089037 11.809126 6.764051 2.067891 0.364496
std 0.0 0.907192 4.754826 4.248842 4.287388 5.666418 3.325775 1.037383 0.094222 3.196160 ... 4.919406 0.984401 0.023590 0.934302 5.103019 4.374694 4.933947 5.900623 4.090548 1.860122
min 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.0 0.000000 1.000000 10.000000 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 1.000000 11.000000 10.000000 0.000000 0.000000 0.000000
50% 0.0 0.000000 4.000000 13.000000 13.000000 4.000000 0.000000 0.000000 0.000000 0.000000 ... 1.000000 0.000000 0.000000 0.000000 4.000000 13.000000 14.000000 6.000000 0.000000 0.000000
75% 0.0 0.000000 9.000000 15.000000 15.000000 11.000000 0.000000 0.000000 0.000000 3.000000 ... 7.000000 0.000000 0.000000 0.000000 10.000000 16.000000 16.000000 12.000000 2.000000 0.000000
max 0.0 8.000000 16.000000 16.000000 16.000000 16.000000 16.000000 15.000000 2.000000 16.000000 ... 16.000000 13.000000 1.000000 9.000000 16.000000 16.000000 16.000000 16.000000 16.000000 16.000000

8 rows × 64 columns

Logistic Regression¶

In [107]:
X = df
y = dataset.target
In [114]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
In [109]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled
Out[109]:
array([[ 0.        , -0.33501649, -0.04308102, ..., -1.14664746,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -1.09493684, ...,  0.54856067,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -1.09493684, ...,  1.56568555,
         1.6951369 , -0.19600752],
       ...,
       [ 0.        , -0.33501649, -0.88456568, ..., -0.12952258,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -0.67419451, ...,  0.8876023 ,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649,  1.00877481, ...,  0.8876023 ,
        -0.26113572, -0.19600752]], shape=(1797, 64))
In [111]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=30)
In [113]:
model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)
Out[113]:
0.9722222222222222

PCA (95%)¶

In [115]:
pca = PCA(0.95)
X_pca = pca.fit_transform(X)
X_pca.shape
Out[115]:
(1797, 29)
In [117]:
evr=pca.explained_variance_ratio_
In [118]:
evr
Out[118]:
array([0.14890594, 0.13618771, 0.11794594, 0.08409979, 0.05782415,
       0.0491691 , 0.04315987, 0.03661373, 0.03353248, 0.03078806,
       0.02372341, 0.02272697, 0.01821863, 0.01773855, 0.01467101,
       0.01409716, 0.01318589, 0.01248138, 0.01017718, 0.00905617,
       0.00889538, 0.00797123, 0.00767493, 0.00722904, 0.00695889,
       0.00596081, 0.00575615, 0.00515158, 0.0048954 ])
In [119]:
evr.sum()
Out[119]:
np.float64(0.9547965245651596)
In [120]:
pca.n_components_
Out[120]:
np.int64(29)
In [124]:
X_pca
Out[124]:
array([[ -1.25946645, -21.27488348,   9.46305462, ...,   3.67072108,
          0.9436689 ,   1.13250195],
       [  7.9576113 ,  20.76869896,  -4.43950604, ...,   2.18261819,
          0.51022719,  -2.31354911],
       [  6.99192297,   9.95598641,  -2.95855808, ...,   4.22882114,
         -2.1576573 ,  -0.8379578 ],
       ...,
       [ 10.8012837 ,   6.96025223,  -5.59955453, ...,  -3.56866194,
         -1.82444444,  -3.53885886],
       [ -4.87210009, -12.42395362,  10.17086635, ...,   3.25330054,
         -0.95484174,   0.93895602],
       [ -0.34438963,  -6.36554919, -10.77370849, ...,  -3.01636722,
         -1.29752723,  -2.58810313]], shape=(1797, 29))
In [125]:
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=30)
In [126]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca, y_train)
model.score(X_test_pca, y_test)
Out[126]:
0.9722222222222222

PCA (2 Composantes)¶

In [131]:
pca2 = PCA(n_components=2)
X_pca2 = pca2.fit_transform(X)
X_pca2.shape
Out[131]:
(1797, 2)
In [132]:
X_pca2
Out[132]:
array([[ -1.25946645, -21.27488348],
       [  7.9576113 ,  20.76869896],
       [  6.99192297,   9.95598641],
       ...,
       [ 10.8012837 ,   6.96025223],
       [ -4.87210009, -12.42395362],
       [ -0.34438963,  -6.36554919]], shape=(1797, 2))
In [133]:
pca2.explained_variance_ratio_
Out[133]:
array([0.14890594, 0.13618771])
In [134]:
pca2.explained_variance_ratio_.sum()
Out[134]:
np.float64(0.285093648236993)
In [135]:
X_train_pca2, X_test_pca2, y_train2, y_test2 = train_test_split(X_pca2, y, test_size=0.2, random_state=30)

model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca2, y_train2)
model.score(X_test_pca2, y_test2)
Out[135]:
0.6083333333333333